Skip to content

Conversation

@kareemshaik80
Copy link

  • Support for single sink logit in flash attention Decode
  • Add Sink to Softmax
  • Cmd line flag added to enable attention sink

 - Support for single sink logit in flash attention Decode
 - Add Sink to Softmax
 - Cmd line flag added to enable attention sink

Signed-off-by: kareem <[email protected]>
@kareemshaik80 kareemshaik80 marked this pull request as draft September 25, 2025 07:38
@kareemshaik80 kareemshaik80 marked this pull request as ready for review September 25, 2025 07:39
Copy link

@yuankuns yuankuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need paper/code reference to ensure this PR is what intended to do

Copy link

@yuankuns yuankuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not changed

@kareemshaik80
Copy link
Author

Also need paper/code reference to ensure this PR is what intended to do

you can refer this: https://arxiv.org/pdf/2309.17453

eager code: https://github.com/huggingface/transformers/blob/caa14e7dabb086f167c14b7eecadc2ba9db25eb6/src/transformers/models/gpt_oss/modeling_gpt_oss.py#L258

@kareemshaik80 kareemshaik80 requested a review from yuankuns October 7, 2025 04:01
Copy link
Author

@kareemshaik80 kareemshaik80 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move test under unit tests.

@Antonyvance
Copy link

@kareemshaik80 I believe this implementation need to change based on this PR 547

@Antonyvance Antonyvance added the redesign required Implementation require a redesign label Oct 17, 2025
@sunjiweiswift
Copy link

sunjiweiswift commented Oct 22, 2025

The calculation here is incorrect.

  1. We use exp2, so we need to add a sink x (constexpr double kLog2e = 1.4426950408889634074);

  2. In oneline softmax, it's more appropriate to process the sink in stage 2, as is done in triton (https://github.com/openai/gpt-oss/blob/0a9ec7f69d8aa71841c5cefcd84a512344b9f1be/gpt_oss/triton/attention.py#L94C4-L100C46). Introducing the sink in stage 1 is incorrect; it will change and gemm results for V

I've completed this functionality in cutlass, f709e32

It is recommended to use a more stringent ut to check the results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

redesign required Implementation require a redesign

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants